Zbiór pochodzi z publikacji w czasopismie Data in Brief.
Tak naprawdę są to dwa zbiory połączone w jeden. Dane dotyczą bowiem dwóch hoteli ulokowanych w Portugalii: jednego miejskiego (w Lizbonie), drugiego w typie kurortu (resort hotel, w regionie Algarve).
Zbiór zawiera informacje o rezerwacjach w obu hotelach w okresie od 1 lipca 2015 do 31 sierpnia 2017 roku.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
pd.options.mode.chained_assignment = None
df = pd.read_csv("data/hotel_bookings.csv")
pd.DataFrame(df.groupby("country")["is_canceled"].mean().sort_values()).head(20)
df[df["reserved_room_type"] == df["assigned_room_type"]]
df.loc[df["days_in_waiting_list"] > 0, "days_in_waiting_list"].hist(bins = 100)
df.groupby(["days_in_waiting_list"])["is_canceled"].mean()
df.groupby(["assigned_room_type"])["is_canceled"].mean()
# Spójrzmy, jak to wygląda
df
df.describe()
# Ile jest wierszy, jakie sÄ… kolumny, gdzie siedzÄ… nulle?
df.info()
W większości nazwy kolumn są samotłumaczące się, a te mniej oczywiste można sprawdzić we wspomnianej pracy. Ja raczej lubię wiedzieć, z czym pracuję, więc na przykład:
lead_time to liczba dni od wprowadzenia rezerwacji do systemu do dnia rozpoczęcia pobytu,distribution_channel i market_segment są związane ze sposobem rezerwacji (market_segment specyfikuje dokładniej distribution_channel),agent to ID agencji podróży, przez którą odbyła się rezerwacja, company to ID firmy/jednostki, która zarezerwowała pobyt lub jest odpowieddzialna za płatność, adr to Average Daily Rate = przychód z rezerwacji / liczba zarezerwowanych pokoi.# Czym się różni distribution_channel i market_segment?
# market_segment to doprecyzowanie distribution_channel
df[df["distribution_channel"] != df["market_segment"]][["distribution_channel","market_segment"]]
df[df["distribution_channel"] == df["market_segment"]][["distribution_channel","market_segment"]]
df["distribution_channel"].value_counts()
Trzeba wyrzucić reservation_status, bo dubluje informacje o is_canceled.
df["reservation_status"].value_counts()
#Jak dużo mamy nulli?
df.isna().sum()
# Czy bezpośrednie rezerwacje to te, które mają NULL agenta?
df[(df["distribution_channel"] == "Direct") & (df["agent"].isna())].shape[0]
# Nie wszystkie :(
# Okej, to trochę dziwne (?), ale część bezpośrednich rezerwacji było dokonywanych przez jakichś określonych agentów???
df[df["distribution_channel"] == "Direct"]["agent"].value_counts()
# 0 sÄ… wolnymi ID w obu przypadkach
(df["agent"] == 0).sum(), (df["company"] == 0).sum()
nan_replacements = {"children:": 0, "country": "Unknown", "agent": 0, "company": 0}
hotels_df = df.fillna(nan_replacements)
hotels_df
hotels_df[['lead_time', 'adr']].hist(figsize=(15,6), bins=80)
hotels_df[['stays_in_weekend_nights', 'stays_in_week_nights']].hist(figsize=(15,6), bins=10, range=(0,9))
hotels_df[['adults', 'children', 'babies']].hist(figsize=(15,12), bins=6, range=(0,5))
plt.show()
num_features = ["is_canceled", "lead_time","arrival_date_week_number","arrival_date_day_of_month",
"stays_in_weekend_nights","stays_in_week_nights","adults","children",
"babies","is_repeated_guest", "previous_cancellations", "booking_changes",
"previous_bookings_not_canceled", "required_car_parking_spaces",
"total_of_special_requests", "adr"]
plt.figure(figsize=(12, 10))
heatmap = sns.heatmap(hotels_df[num_features].corr(), annot=True, vmin=-1, vmax=1, cmap="BrBG")
plt.show()
Skoro mamy zajmować się przewidywaniem, czy dana rezerwacja zostanie odwołana, zobaczmy jak częsty jest to problem...
plt.figure(figsize=(12,8))
ax = sns.countplot(data = hotels_df, x = "hotel", hue = "is_canceled")
for p in ax.patches:
ax.annotate('{}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+100))
plt.title("Number of canceled and non canceled bookings", fontsize=16)
plt.xlabel("Hotel type", fontsize=16)
plt.ylabel("Number of bookings", fontsize=16)
plt.legend(title = "Booking status", labels = ["not canceled", "canceled"])
plt.show()
Zaskakująco często.
deposit_cancel_data = hotels_df.groupby("deposit_type")["is_canceled"].describe()
plt.figure(figsize=(12, 8))
sns.barplot(x=deposit_cancel_data.index, y=deposit_cancel_data["mean"] * 100, color = "steelblue")
plt.title("Effect of deposit_type on cancelation", fontsize=16)
plt.xlabel("Deposit type", fontsize=16)
plt.ylabel("Cancelations [%]", fontsize=16)
plt.show()
hotels_df["deposit_type"].value_counts()
"Indication on if the customer made a deposit to guarantee the booking.
This variable can assume three categories:
hotels_df[hotels_df["deposit_type"] == "Non Refund"].groupby(['hotel', 'is_canceled']).size().reset_index()
bookings_monthly = hotels_df[["hotel", "arrival_date_month", "arrival_date_year", "is_canceled", "adr"]]
ordered_months = ["January", "February", "March", "April", "May", "June",
"July", "August", "September", "October", "November", "December"]
bookings_monthly.loc[:,"arrival_date_month"] = pd.Categorical(bookings_monthly["arrival_date_month"], categories=ordered_months, ordered=True)
bookings_monthly = bookings_monthly.groupby(["hotel", "arrival_date_month", "is_canceled"]).size().reset_index(name='counts')
#w ramce dane za lipiec i sierpień występują 3 razy, za pozostałe miesiące - 2 razy
bookings_monthly.loc[(bookings_monthly["arrival_date_month"] == "July") | (bookings_monthly["arrival_date_month"] == "August"),
"counts"] /= 3
bookings_monthly.loc[~((bookings_monthly["arrival_date_month"] == "July") | (bookings_monthly["arrival_date_month"] == "August")),
"counts"] /= 2
plt.figure(figsize=(12, 8))
plt.ylim(0, 2500)
sns.lineplot(data=bookings_monthly[bookings_monthly["hotel"] == "City Hotel"], x = "arrival_date_month", y="counts", hue="is_canceled")
plt.legend(title = "Booking status", labels = ["not canceled", "canceled"])
plt.title("Average number of bookings in City Hotel over the year", fontsize=16)
plt.xlabel("Arrival date month", fontsize=16)
plt.ylabel("Number of bookings", fontsize=16)
plt.show()
plt.figure(figsize=(12, 8))
plt.ylim(0, 1500)
sns.lineplot(data=bookings_monthly[bookings_monthly["hotel"] == "Resort Hotel"], x = "arrival_date_month", y="counts", hue="is_canceled")
plt.legend(title = "Booking status", labels = ["not canceled", "canceled"])
plt.title("Average number of bookings in Resort Hotel over the year", fontsize=16)
plt.xlabel("Arrival date month", fontsize=16)
plt.ylabel("Number of bookings", fontsize=16)
plt.show()
df.country.value_counts()
plt.figure(figsize=(12, 8))
sns.countplot(data = hotels_df[hotels_df["hotel"] == "City Hotel"], x = "country", hue = "is_canceled",
order = pd.value_counts(hotels_df['country']).iloc[:15].index)
plt.legend(title = "Booking status", labels = ["not canceled", "canceled"])
plt.title("Number of bookings in City Hotel by country of origin of guests", fontsize=16)
plt.xlabel("Country", fontsize=16)
plt.ylabel("Number of bookings", fontsize=16)
plt.show()
plt.figure(figsize=(12, 8))
sns.countplot(data = hotels_df[hotels_df["hotel"] == "Resort Hotel"], x = "country", hue = "is_canceled",
order = pd.value_counts(hotels_df['country']).iloc[:15].index)
plt.legend(title = "Booking status", labels = ["not canceled", "canceled"])
plt.title("Number of bookings in Resort Hotel by country of origin of guests", fontsize=16)
plt.xlabel("Country", fontsize=16)
plt.ylabel("Number of bookings", fontsize=16)
plt.show()
country_data = pd.DataFrame(hotels_df.loc[hotels_df["is_canceled"] == 0]["country"].value_counts())
country_data.rename(columns={"country": "Number of Guests"}, inplace=True)
total_guests = country_data["Number of Guests"].sum()
country_data["Guests in %"] = round(country_data["Number of Guests"] / total_guests * 100, 2)
country_data["country"] = country_data.index
guest_map = px.choropleth(country_data,
locations=country_data.index,
color=country_data["Guests in %"],
hover_name=country_data.index,
color_continuous_scale=px.colors.sequential.Plasma,
title="Origin country of guests")
guest_map.show()
leadtime = hotels_df.copy()
leadtime["bin"] = pd.qcut(leadtime["lead_time"], q = np.arange(0.0, 1.0, 0.1))
leadtime_plot = leadtime.groupby("bin")["is_canceled"].mean() * 100
leadtime_plot = leadtime_plot.reset_index()
plt.figure(figsize=(16, 8))
sns.barplot(x = "bin", y="is_canceled", data=leadtime_plot, color = "steelblue")
plt.title("Effect of lead time on cancelations", fontsize=16)
plt.xlabel("Lead time bin", fontsize=16)
plt.ylabel("Cancelations [%]", fontsize=16)
plt.show()
plt.figure(figsize=(6, 8))
sns.boxplot(data = hotels_df, x = "is_canceled", y = "lead_time")
plt.show()
res = hotels_df.loc[(hotels_df["hotel"] == "Resort Hotel")]
city = hotels_df.loc[(hotels_df["hotel"] == "City Hotel")]
res["total_nights"] = res["stays_in_weekend_nights"] + res["stays_in_week_nights"]
city["total_nights"] = city["stays_in_weekend_nights"] + city["stays_in_week_nights"]
res_plot = res.groupby(['total_nights', "is_canceled"]).size().reset_index()
city_plot = city.groupby(['total_nights', "is_canceled"]).size().reset_index()
res_plot
plt.figure(figsize=(16, 8))
sns.barplot(x = "total_nights", y = 0, hue="is_canceled", data=city_plot)
plt.title("Bookings and length of stay in City Hotel", fontsize=16)
plt.xlabel("Number of nights", fontsize=16)
plt.ylabel("Number of bookings", fontsize=16)
plt.xlim(0,22)
plt.show()
plt.figure(figsize=(16, 8))
sns.barplot(x = "total_nights", y = 0, hue="is_canceled", data=res_plot)
plt.xlabel("Number of nights", fontsize=16)
plt.ylabel("Guests [%]", fontsize=16)
plt.xlim(0,22)
plt.title("Bookings and length of stay in Resort Hotel", fontsize=16)
plt.xlabel("Number of nights", fontsize=16)
plt.ylabel("Number of bookings", fontsize=16)
plt.show()
segments=hotels_df.groupby(["market_segment", "is_canceled"]).size().reset_index()
segments.sort_values(by = [0], ascending = False, inplace = True)
plt.figure(figsize=(16, 8))
sns.barplot(x = "market_segment", y = 0, hue="is_canceled", data=segments)
plt.title("Bookings by market segments", fontsize=16)
plt.xlabel("Market segment", fontsize=16)
plt.ylabel("Number of bookings", fontsize=16)
plt.legend(loc="upper right", title="is canceled")
plt.show()
segments_pct = segments.pivot_table(index = "market_segment", columns = "is_canceled", values = 0)
segments_pct["pct_of_cancelations"] = segments_pct[1] / (segments_pct[1] + segments_pct[0])
segments_pct.sort_values(by = "pct_of_cancelations")
Widzimy chociażby, że rzadziej odwoływane są rezerwacje robione osobiście, bez pośrednictwa. Bardzo często natomiast anulowane są rezerwacje grupowe
plt.figure(figsize=(12, 8))
sns.countplot(data = hotels_df, x = "agent", hue = "is_canceled",
order = pd.value_counts(hotels_df['agent']).iloc[:15].index)
plt.title("Bookings by agents", fontsize=16)
plt.xlabel("Agent ID", fontsize=16)
plt.ylabel("Number of bookings", fontsize=16)
plt.legend(title = "Booking status", labels = ["not canceled", "canceled"])
plt.show()
agents_bookings = hotels_df.groupby("agent")["is_canceled"].describe()
agents_bookings[["count", "mean"]].sort_values("count", ascending = False).head(20).sort_values("mean", ascending = False)
plt.figure(figsize=(8, 8))
sns.violinplot(data = hotels_df, x = "is_canceled", y = "adr")
plt.ylim([-100,1000])
plt.show()
city = hotels_df[hotels_df["hotel"] == "City Hotel"]
resort = hotels_df[hotels_df["hotel"] == "Resort Hotel"]
city["bin"] = pd.qcut(city["adr"], q = np.arange(0.0, 1.1, 0.1))
resort["bin"] = pd.qcut(resort["adr"], q = np.arange(0.0, 1.1, 0.1))
city = city.groupby(["bin", "is_canceled"])["hotel"].count().reset_index(name = "counts")
resort = resort.groupby(["bin", "is_canceled"])["hotel"].count().reset_index(name = "counts")
city2 = city.pivot_table(index = "bin", columns = "is_canceled", values = "counts").reset_index()
city2["pct_canceled"] = city2[1] / (city2[0] + city2[1])
resort2 = resort.pivot_table(index = "bin", columns = "is_canceled", values = "counts").reset_index()
resort2["pct_canceled"] = resort2[1] / (resort2[0] + resort2[1])
plt.figure(figsize=(16, 8))
sns.barplot(x = "bin", y = "counts", hue="is_canceled", data=city)
plt.title("Bookings in City Hotel by ADR", fontsize=16)
plt.xlabel("ADR", fontsize=16)
plt.ylabel("Number of bookings", fontsize=16)
plt.legend(loc="upper right", title="is canceled")
plt.show()
city2
plt.figure(figsize=(16, 8))
sns.barplot(x = "bin", y = "counts", hue="is_canceled", data=resort)
plt.title("Bookings in Resort Hotel by ADR", fontsize=16)
plt.xlabel("ADR", fontsize=16)
plt.ylabel("Number of bookings", fontsize=16)
plt.legend(loc="upper right", title="is canceled")
plt.show()
resort2
hotels_df.loc[(hotels_df["reserved_room_type"] != hotels_df["assigned_room_type"]), "is_canceled"].mean()
hotels_df.loc[(hotels_df["reserved_room_type"] == hotels_df["assigned_room_type"]), "is_canceled"].mean()
Raczej nie bardzo
hotels_df.loc[hotels_df["babies"] + hotels_df["children"] > 0, "is_canceled"].mean()
hotels_df.loc[hotels_df["babies"] + hotels_df["children"] == 0, "is_canceled"].mean()
Też nie
hotels_df.loc[hotels_df["is_repeated_guest"] == 1, "is_canceled"].mean()
hotels_df.loc[hotels_df["is_repeated_guest"] == 0, "is_canceled"].mean()
TAK